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1. INTRODUCTION 

Outliers are the unusual, unexpected patterns in the observed world. Outliers exist extensively in real 
world, and they are generated from different sources: a heavily tailed distribution or errors in inputting the data. 
While there is no single, generally accepted, formal definition of an outlier, Hawkins’ definition captures the 
spirit: “an outlier is an observation that deviates so much from other observations as to arouse suspicions that 
it was generated by a different mechanism” [1]. Anomaly detection is an important problem that has been 
researched within diverse research areas and application domains such as fraud detection [2], intrusion 
discovery [3], video surveillance, pharmaceutical test and weather prediction. There are different surveys about 
classical outliers and abnormal detectio. They vary between density based approaches [3], statistical [4], 
distance-based [5], neural networks and machine learning techniques. 

Recent research studies on outlier detection have focused on examining the nearest neighbor structure 
of a data object to measure its outlierness degree [6-7]. Such techniques are based on the key assumption that 
instances of normal data occur in dense neighborhoods, while outliers occur far away from their closest 
neighbors [8]. Popular outlier detection methods require the pairwise comparison of objects to compute the 
nearest neighbors. This quadratic problem is not scalable to large data sets, making outlier detection for large 
scale data still an open challenge. This paper proposes a fast outlier detection method for large scale datasets, 
which consists of two steps: a granulation of the universe into parts with the same properties then the computing 
of the degree of outlierness called Fuzzy neighborhood rough set outlier factor (FNROF) for each granule 
formed. Granulation of the obesevable universe involves grouping of similar elements into granules. With 
granulated views, we deal with approximations of concepts, represented by subsets of the universe, in terms of 
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granules [9]. The remainder of this paper is organized as follows. In the next section, we present some 
preliminaries of rough set theory that are relevant to this paper and discussion of the granularity of knowledge 
in connection with rough and fuzzy sets. In Section 3, we propose an efficient parallel computing system based 
on Map Reduce in order to improve the speed of computation and the algorithm proposed that deal with more 
complex outlier detection problems for large scale data. 


2. ROUGH SETS (RST) 

Rough set theory RST [10-11] is a new mathematical approach to imperfect knowledge. The theory 
has attracted attention of many researchers and practitioners all over the world, who contributed essentially to 
its development and applications. The main advantage of rough set theory in data analysis is that it does not 
need any preliminary or additional information about data. Rough set theory is a popular and powerful machine 
learning tool. It is especially suitable for dealing with information systems that exhibit inconsistencies. In rough 
set theory, an information table is defined as a tuple T = (U, A) where U and A are two finite, non-empty sets 
with U the universe of primitive objects and A the set of attributes. Each attribute or feature a € A is associated 
with a set V, of its value, called the domain of a. We may partition the attribute set A into two subsets C and 
D, called condition and decision attributes, respectively. Let P C A be a subset of attributes. The indiscernibility 
relation, denoted by: 


IND(P) = {(x,y) € U?/Va € P, a(x) = a(y)} (1) 
Where a(x) denotes the value of feature of object x. 
If (x, y) € IND (P), x and y are said to be indiscernible with respect to P. The family of all equivalence classes 
of IND (P), referring to a partition of U determined by P, is denoted by U/IND(P). Each element in U/IND (P) 


is a set of indiscernible objects with respect to P. The family of all equivalence classes of IND (P), referring to 
a partition of U determined by P, is denoted by U/IND (P). 


Where A®B= {XN Y/XEA,Y EB,XNY # G} (2) 


For any concept X & U, X could be approximated by the P-lower approximation and P-upper approximation 
using the knowledge of P. The lower approximation of X is the set of objects of U that are surely in X: 


P(X) = U{E € U/IND(P): E & xX} (3) 
The upper approximation of X is the set of objects of U that are possibly in X, defined as: 
P(X) = U{E € U/IND(P): EN X # 9} (4) 


The concept defining the set of objects that can possibly, but not certainly, be classified in a specific way is 
called the boundary region, which is defined as: BN(P) =P(X) — P(X) as shown in Figure 1. 


ee : 
—— approximation 


Set X 


lower a hae ion 


Figure 1. Representation of the data partitioning for a subset X 
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2.1. Rough set and fuzzy discretization 

The extraction of knowledge from a huge volume of data using rough set methods requires the 
transformation ofcontinuous value attributes to discrete intervals, in order to form a grid structure and then 
form clusters from the cells in the grid structure. Clusters correspond to regions that are denser in data points 
than their surroundings. The great advantage of grid-based clustering is a significant reduction in time 
complexity, especially for very large data sets. The concepts of real rough space, it is well known that one of 
the research premises in the classical rough sets theory is the information or the data to be discrete. 
Discretization can be viewed as a data reduction technique which reduces the range of values of a continuous 
values attribute into a minimum number of discrete intervals. The numbers of cut-points can determine the 
level of data reduction. The fewerthe number of cut-points the more the data will be reduced and hence a 
generalized classifierwill be possible. The term “cut-point” refers to a real value within the range of continuous 
values that divides the range into intervals. Cut-point is also known as split-point. The great advantage of grid- 
based clustering is a significant reduction in time complexity, especially for very large data sets. But during 
the discretization process, if the discretization is too rough, much useful information may be lost. And if the 
discretization is too exact, it will take a lot of time complexity. So, it can be said that the disadvantages of 
classical rough sets are too much depending on good or bad of the discretization methods and the limited 
application domain. 

Let X = (%1,%2,..,X,) be a provided dataset having n objects and Aattributes, vyinj=min(%;), 
Vmaxj=max(x;) be the minimum and maximum values of attributes i. Each attribute [Vinini, Vmaxi] is equally 
divided into M intervalsw; = (Vmaxi-Vmini)/M. The set of all initial interval of an attribute i is shown 
as:Interv; = {Vminis (UminitWj), (mini t2* Wj), ---s Vmax id 


2.2. Fuzzy rough sets 

Fuzzy rough set theory extends rough set theory to data with continuous attributes, and detects degrees 
of inconsistency in the data. Key to this is turning the indiscernibility relation into a gradual relation. The fuzzy 
set is actually a fundamentally broader set compared with the classical or crisp set. The classical set only 
considers a limited number of degrees of membership such as ‘0’ or ‘1’, or a range of data with limited degrees 
of membership as shown in Figure 2. 


Definition 1: (Fuzzy Sets) A fuzzy set, F, defined over universe X is a function defined as: 

F = {(%, u(x))|u(x) € [0,1], vx € X} (5) 
Function (x) is called the membership function, which maps object x to the membership space. The rough 
membership function expresses conditional probability that x belongs to X given P and can be interpreted as a 
degree that x belongs to X. One of the most important concepts in fuzzy set theory and applications is the a- 
cut decomposition theorem developed by Zadeh in 1971 under the name resolution identity. These cuts are 
crisp sets associated with certain levels a that represent distinct grades of membership. 


Definition 2: (FS a-cut) given a numbera € [0,1], a o-cut or a-level set, ofa fuzzy set F is defined by: 


Fo = {@, x) [Hx 2 a, Var € [0,1] }i fay < ay, Fy, 2 Fa, (6) 


Figure 2. (Alpha, Beta)-cuts of fuzzy set F 


We define the membership function of the Intersection of two fuzzy sets A(x,u4,(x)) and B(x,up, (x)) as: 
1 
(AN B)a = (% MagnBg(%) = 5 * Hag (%) + Hag (*))) xX EX 
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2.3. Rough sets: neighborhood systems 

The concept of information granulation was first introduced by Zadeh in the context of fuzzy sets in 
1979 [12]. The basic ideas of information granulation have appeared in fields, such as interval analysis, 
quantization, rough set theory and many others. There is a fast growing and renewed interest in the study of 
information granulation and computations under the term of Granular Computing (GrC). [13] Granulation of a 
universe involves grouping of similar elements parts, or the grouping of individual elements or objects into a 
family of disjoint subsets, based on available information and knowledge. The combination of topological 
spaces and rough sets and the properties of topological rough spaces are discussed [14] used neighborhood 
systems and topological concept in the study of approximations. Neighborhood system is a mathematical 
structure of granular computing to model granules, and can be used to compute structure of granules and/or 
between granules. A neighborhood system at a point is a framework to capture the concept of “near” objects, 
and any subset of objects can be approximated by a set of neighborhoods. A neighborhood system defines a 
set of binary relations, and a set of binary relationships can be used to define a neighborhood system. 


Definition 3 (neighborhood of object x;): Given an arbitraryx; € U and P © C, the nearest neighborhood 65 (x;) 
of x; in feature space P is defined as: 


65 (x) = {xj|A4? (xx) S 5 € R+} (7) 


Where A: U x U > R +,a distance (similarity) function and R+ is the set of non-negative real number.6?: The 
neighborhood information granule included objects x; and the size of the neighborhood 
depends on thresholde. 

For each value ofs € R+, we propose the following neighborhood system as the collection of all 
neighborhoods of x € U as: 


NP (x) = {6?(x)|s E R+,P SC} (8) 
Where s is a sliding windows for overlapping computation: s<M. 
Theorem 1: For eachP, © A,P, SA. Ne (x) is a neighborhood relation induced in feature subspace P. 


We have: N°"? (7) = Né*(x) N N22 (x) 
if A=UiP; so N(x) = MiNe'(x) (9) 


Given a set of objects U and a neighborhood system N; over U, we call <U, N,> a neighborhood approximation 
space. The lower and upper approximations (NX .NX) of X in <U, Ns>, are defined as: 


NX = N.(x) == 
> et 5) NX = Ungeynxeo No(2) 


Obviously, NX G X © NX. The boundary region of X in the approximation space is defined as: 


BNX = NX — NX 


The size of boundary region reflects the degree of roughness of set X in the approximation space <U,N,>. 
Assuming X is the sample subset with a decision label; generally speaking, we hope the boundary region of 
the decision should be as small as possible for decreasing uncertainty in decision. The size of boundary region 
depends on X, attributes to describe U. 

For a fixed pair of numbers (a@,a,) € [0, 1]x[0, 1], we obtain a submodel in which a crisp set F,is 
approximated in a crisp approximation space apryq, = (U,Usa,) The result is a rough set 


((@PT ra, (F), AW nao (F)) with the reference set F.Each granule in fuzzy sets Fis a neighborhood of an element 


of the universe. The approximation is defined by show in Figure 3: 


NA = Un.)¢Fay Ns (x) ’ for Fe 7 Fao CU (10) 
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NA = Un,coera, Ns(*) ay 


F 
In this case, the subset “4 (lower approximation) contains two clusters C1 (grid 2) and C2 (grid 3) 
Fy, =O, UC, 


Figure 3. Fuzzy rough set approximation 


The root grid GridO (universe U) with the coarsest granularity covers the entire datasets, which contains one 
sub grids: grids 1 (upper approximation: F,,) at level 1 also contains two sub grids at level 2 


(lower approximation. : Fy, ) 


2.4. Fuzzy neighborhood rough set outlier factor (FNROF) 

In this paper, a new method for ranking outlier which is proposed based on fuzzy rough set denoted 
“Fuzzy neighborhood rough set outlier factor” FNROF. After dividing each dimension into intervals of equal 
length M, the density distribution of each cell (information granularity) can be defined as the ratio of its density 
and the average density of its k neighboring cells. 


Pp n di diy own ni Nj 
Si= LEG) *losG) = 2FiC) *losG) (12) 
J J J J 
f : n ; 
Peer = 2g 
dj M™ nj Ni 


A normalized score of @} is given as follow: 


~%P ~%P 
Si 7 Smin 
Pin P 
Ca" 24 


Smax 


P=1- 


It’s viewed as the relative density measure of cl; (di) with respect to the density of n surrounding neighbor’s 
cell. When the probability is uniformly distributed, we are most uncertain about the outcome, the entropy 
(score) is the highest in this case. On the other hand, when the data points have a highly probability mass 
function, we know that the variable is likely to fall within a small set of outcomes so the uncertainty and the 
entropy (score) are low. The size of interval must be carefully selected. If the interval size is too small, there 
will be many cells so that the average number of points in each cell can be too small. On the other hand, if the 
interval size is too large, we may not be able to capture the differences in density in different regions of the 
space. Unfortunately, without knowing the distribution of the data sets, it is difficult to estimate the minimal 
average number of points required in each cell to have the correct result. 


Definition 4: Directly density-reachable: A cell cl; is directly density-reachable from a cell cl; if only if, iS 
B and cl; € N(cl;)where A? (cl;,cl;) = ¢;° — ¢° 
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Definition 5: Density-connected. A cell cl; is density-connected to a cell cl; if there is a cell cl, such that both 
cl, and cl; are density-reachable from cl; as shown in Figure 4. 


Figure 4. The concept of density-reachability and density-connectivity to form clusters as contiguous dense 
regions in lower approximation 


2.5. A novel approach: A high-performance parallel and distributed computation using mapreduce 

In order to compute an optimal set of cut-points, most of discretization algorithms perform an iterative 
search in the space of candidate discretizations, using different types of scoring functions for evaluating a 
discretization, that take a lot of time. In this paper, we propose a parallel process of discretization based on 
MapReduce using sliding grid. A sliding grid is specified by defining its range M and slide S. The range M is 
an interval of discretization while the slide S specifies the portion of the grid that is moved forward. A sliding 
window is specified as a tuple (M,s). A smooth sliding specification is highly desired where the slide S issmall 
relative to the range M. whereS < M. The proposed algorithm based on MapReduce computed for each node 
i(P; & A) is a parallel process that consists of three steps: map, shuffle, and reduce as shown in Figure 5. 


Node1 (P1) Node 2 (P2) Node 3 (P3) 
$1 s2 $384 s5 $1 s2 $384 s5 $1 52 354 s5 


ue «= OND tttts tttt4 
mE Oe 
tees teded teddy 


a PE — 


+ 
Reduce ~~ ace a 


Figure 5. Framework MapReduce proposed 


Example: S=<U,A={C1,C2,C3,C4,C5 } > 
Pl= {C1,C2 } 

P2= {C2, C3} 

P3= {C3, C4, C5} 

(P; © Aand A= P, UP, UP3) 
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At node 1 (P1): 
Each worker node that applies the map function related to each grid defined by _ tuple 
{(M,s1),(M,s2),(M,s3),(M,s4),(M,s5) } 


Inmap phase, for each grid given tuple (M,s), we generates a list (key=cl;,value=¢,”) where ,” is a score of 
cl;. In shuffle phase, the output pairs are partitioned and then transferred to reducers. In reduce phase, pairs 
with the same key are grouped together as (cl), list(¢ jo )) as shown in Figure 6. 


Figure 6. Illustrates how the cell overlaps when the grid move 


Then the reduce function generates the final output pairs list (cl,,¢,”) for each fuzzy approximation. 
The whole process can be summarized as follows: 

Map: (M,s) ® (cli.¢:”) 

Reduce: (cl;, list(;")) D(cly.Sx") 

A parallel computing of FNROF and its template implementation 


Master: 


Get {cl;,¢;", ListCl;} from the result queue 
If (¢" = B) 
for each cell cl,in ListCl; 
if (¢.” > B and cl, is not labeled ) 

{ 


Comstio = Cetustip U Cle 
Label (cl) = clustID 


putcl, in the candidate queue 


} 


Slave: 
Get Cellcl; from the candidate queue 
ListCl, = neighborhood (cl;) 


put {cl;,¢;", ListCl;} in the result queue 


Algorithm MR- FNROF: Fast outlier detection algorithm based on fuzzy neighborhood rough and a pipeline 
parallelism between master and slave module. 


clustID =0 
for each cell cl; in grid database 


{ 

if(cl;is not labeled) 
{ 

Iti" < B) 

{ 
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NEG = NEG Ucl; 
Label (cl;) =noise 

} else if (J; < a) 

{ 

NEG = NEG Ucl; 
Label (cl;) =boundary 
} else { 


While (there are pending results) 


{ 


Master in (neighborcl;) out candidatecl, 
Parallel: Slave in (candidatecl,) out (neighborcl;) 


} 
clustID =clustID + 1 
} 


Example of computation (At a single node P): 


Map phase: 

Cell i: cl; 
20 40 10 
70 90 100 
80 99 102 


37'=(90/20)*log(90/20)+(90/40)*log(90/40)+(90/10)*log(90/10)+(90/70)*log(90/70)+(90/100)*log(90/100)+ 


(90/80)*lo g(90/80)+(90/99)*log(90/99)+(90/102)*log(90/102) 


HP =28.53 

Cell k: cl, after moving the grid 
24 44 14 
73 94 105 
84 103 107 


¥P=(94/24)*log(94/24)+(94/44)*log(94/44) 


+(94/14)*log(94/14)+(94/73)*log(94/73)+(94/105)*log(94/105)+(94/84)*log(94/84)+(94/103)*log(94/103)+ 


(94/107)*log(94/107) 
3P==19.90 
PF x80 


ce =036F = 0.25 


Shuffle and Reduce phase: 
Given a cut point a) = 0.3 
Ceiy-cly = Sey = 0.36 > ao 
Ce—cl; = Sely = 9-25 <M 


Gcnticly = 2 * (i, a ee) 
1 
Gainety => * (0.36 + 0.25) = 0.305 > a 
Lower approximation: 
(cl; N clk.) GX 


(cl; — ch.) SX 
X=X+ (cl; —ch) + (cl N cl) 


Upper approximation: 


(cl, Vel) SX 
X =k + (el. ch) 
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3. EXPERIMENTS AND RESULTS 

The algorithm proposed is tested with synthetic and real data collected from NOAA center. 
The implementation of this work was realized in R using RStudio. Datasets NOAA: [15] The National Climatic 
Data Center — NOAA: collects a wide range of data; including sensor streams with temporal information, sensor 
spatial information, temperature, etc. 


3.1. Improvement in search time efficiency 

The purpose of the experiment was to compare the performance between the algorithm proposed MR- 
FNROF and the original LOF algorithm in terms of matching detected outliers and execution time. Comparing 
the performance of the tow methods, it shows that our method have a very fast processing time with acceptable 
trade-off errors as show in Table 1. 


Table 1. Time taken and matching detected outliers according to the number of objects in the dataset for both 


MR-FNROF and LOF method 
Number of objects Time taken (seconds) Number of outliers detected 
MR-FNROF Method (9 nodes) LOF method MR-FNROF Method (9 nodes) LOF method 
2023 0.29 5.37 203 123 
4845 0.34 11.3 302 284 
19768 1.9 50.2 713 688 
938419 8.49 523.4 2023 1987 


3.2. Performance of MR-FNROF according to number of workers nodes 

The second experiment shows that reduction of the risk of a Type I & II error is performed by 
increasing the number of workers nodes as shown in Figure 7. With high number of workers nodes, we are 
getting more outlier detected in upper approximation rough set (less of type II errors). 


x{, 2] 


20 10 0 10 20 30 40 
xf, 1] xf, 1] 
3 workers nodes 5 workers nodes 
i 


x[, 2] 


xi, 1] 
7 workers nodes 


Figure 7. Anomaly detection using successively 3, 5 and 7 workers nodes given 
(alpha, beta)-cuts = (20%, 50%) 
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4. CONCLUSION 

The aim of this paper is to propose a new algorithm of outlier detection that reduces the computation 
time required by using granular computing method and fuzzy rough set thoery. The algorithm MR- FNROF 
divides the universes into a smaller number of granules, and calculates the factor of outlierness for each granule. 
To examine the effectiveness of the proposed method, several experiments incorporating different parameters 
were conducted. The proposed method MR- FNROF, demonstrated a significant computation time reduction. 
Moreover, it can also be effectively used for real-time outlier detection. 
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